Background Information

The purpose of our analysis involves exploring potential methods of increasing ridership for the DC Captial Bikeshare program. This motivation in focusing on ridership aligns with the DC Bikeshare’s main mission in transforming the Washingtion Metropolitan Area and residing communities by providing an affordable bicycle transit system. The goals outlined in this targeted reformation includes decreasing traffic congestion, promoting health and wellness, decreasing air pollution, and expanding transportation options. The DC Biekshare program is owned by several jurisdictions however, is mainly managed by the DC local government.

More information pertaining to the bikeshare program can be found by following this link: https://capitalbikeshare.com/

Business Understanding

Following the methods outline in the CRISP-DM Methodology, we initially focused on understanding the various business aspects to our problem before analyzing any data. We determine our business objectives by researching the background of the DC Bikeshare program, as seen above. Next, we determine our data mining goals and identified a overall question we aim to solve for our analysis.

This question pertains to the following: How do we make it safer for people to ride bikes as a means of increasing ridership?

Baseline Packages

Baseline packages for our analysis:

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.2
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate) # because we will probably see some dates
library(here) # more easily access files in your project
## Warning: package 'here' was built under R version 4.3.2
## here() starts at U:/DS241/Final-Bikeshare-Project_DS241_Team-4
library(mapview)
## Warning: package 'mapview' was built under R version 4.3.2
library(gbfs)
## Warning: package 'gbfs' was built under R version 4.3.2
library(sf) # working with simple features - geospatial
## Warning: package 'sf' was built under R version 4.3.2
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library(tmap)
## Warning: package 'tmap' was built under R version 4.3.2
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.3.2
library(stringr)
library(raster)
## Warning: package 'raster' was built under R version 4.3.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.3.2
## 
## Attaching package: 'raster'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(readr)
library(maps)
## Warning: package 'maps' was built under R version 4.3.2
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
library(spData)
## Warning: package 'spData' was built under R version 4.3.2
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(viridis)
## Warning: package 'viridis' was built under R version 4.3.2
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## 
## The following object is masked from 'package:maps':
## 
##     unemp
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.3.2
library(ggbeeswarm)
## Warning: package 'ggbeeswarm' was built under R version 4.3.2

Accessing and Loading the Databases

We are initially looking at the following three data sets:

  1. The DC Bikeshare System Data for September 2023: https://capitalbikeshare.com/system-data
  2. The DC Protected vs Unprotected Bikelane Dataset: https://opendata.dc.gov/datasets/bicycle-lanes/explore
  3. The DC Crimes and Accidents Dataset: https://opendata.dc.gov/datasets/crashes-in-dc/explore
bikes_september <- read_csv(here("data_raw","202309-capitalbikeshare-tripdata.csv")) %>% clean_names()
## Rows: 450090 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bikes_lanes = st_read(here("data_raw", "Bicycle_Lanes.geojson")) %>% clean_names()
## Reading layer `Bicycle_Lanes' from data source 
##   `U:\DS241\Final-Bikeshare-Project_DS241_Team-4\data_raw\Bicycle_Lanes.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 2275 features and 26 fields
## Geometry type: LINESTRING
## Dimension:     XYZ
## Bounding box:  xmin: -77.08773 ymin: 38.82359 xmax: -76.93066 ymax: 38.98276
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
bikes_crashes = st_read(here("data_raw", "Crashes_in_DC.geojson")) %>% clean_names()
## Reading layer `Crashes_in_DC' from data source 
##   `U:\DS241\Final-Bikeshare-Project_DS241_Team-4\data_raw\Crashes_in_DC.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 67621 features and 56 fields
## Geometry type: POINT
## Dimension:     XYZ
## Bounding box:  xmin: -77.11425 ymin: -9.000001 xmax: 77.00682 ymax: 41.25995
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
csv_bikes_lanes = read_csv(here("data_raw", "Bicycle_Lanes.csv")) %>% clean_names()
## Rows: 2275 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): ROUTENAME, SUBBLOCKKEY, BIKELANE_PARKINGLANE_ADJACENT, BIKELANE_TH...
## dbl  (8): ROUTEID, ROADTYPE, QUADRANT, TOTALBIKELANES, TOTALBIKELANEWIDTH, W...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
csv_bikes_crashes = read_csv(here("data_raw", "Crashes_in_DC.csv")) %>% clean_names()
## Rows: 67621 Columns: 58
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): REPORTDATE, ROUTEID, FROMDATE, ADDRESS, WARD, EVENTID, MAR_ADDRESS...
## dbl (45): X, Y, OBJECTID, CRIMEID, CCN, MEASURE, OFFSET, STREETSEGID, ROADWA...
## lgl  (1): TODATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dc_shape = st_read(here("data_raw","DC_Health_Planning_Neighborhoods.geojson")) %>% clean_names()
## Reading layer `DC_Health_Planning_Neighborhoods' from data source 
##   `U:\DS241\Final-Bikeshare-Project_DS241_Team-4\data_raw\DC_Health_Planning_Neighborhoods.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 51 features and 8 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -77.11976 ymin: 38.79165 xmax: -76.9094 ymax: 38.99556
## Geodetic CRS:  WGS 84

Checking the Class of Each Data Frame

class(bikes_lanes)
## [1] "sf"         "data.frame"
class(bikes_crashes)
## [1] "sf"         "data.frame"
class(dc_shape)
## [1] "sf"         "data.frame"
class(csv_bikes_crashes)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
class(csv_bikes_lanes)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Initial Map of the Bike Lane Data

mapview(dc_shape, lwd=3, alpha=0.2) + 
  mapview(bikes_lanes, zcol = "streetname", lwd = 3, layer.name = "bike_lanes", legend = FALSE) 

Filtering the Crash Data

Filtering the crash data. First, we looked at variables that we are interested in analyzing from the total data set. We then filtered the data to only look at accidents relating the bicycles. This was determined by setting a requirement that there must be at least 1 bicycle involved in the accident. Columns not relevant to our analysis was removed from the data set.

df1_bike_crashes = csv_bikes_crashes %>%
   filter(total_bicycles >= "1") %>%
  dplyr::select(-c(locationerror,objectid, crimeid, ccn, reportdate,routeid, streetsegid, roadwaysegid, todate,eventid, majorinjuries_driver, majorinjuries_pedestrian, majorinjuriespassenger,minorinjuries_driver, minorinjuries_pedestrian, minorinjuriespassenger,fatal_driver, fatal_pedestrian, fatalpassenger,unknowninjuries_driver,unknowninjuries_bicyclist,unknowninjuries_pedestrian,unknowninjuriespassenger, mar_address,mar_score,mar_id,total_vehicles,total_pedestrians,total_taxis,total_government,pedestriansimpaired,driversimpaired, speeding_involved, lastupdatedate)) 

Converting the Crash Data into a Sf object for Plotting

sf_bike_crashes <- df1_bike_crashes %>%
  st_as_sf(coords = c("longitude","latitude"), crs = 4326) %>%
  st_cast("POINT")

class(sf_bike_crashes)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

Plotting the Bike Lanes and Crash Data Together

mapview(dc_shape, lwd=3) + 
  mapview(bikes_lanes, zcol = "streetname", lwd = 3, layer.name = "bike_lanes", legend = FALSE) +
  mapview(sf_bike_crashes, color = "red", lwd = 0.5, layer.name = "bike_crashes", legend = FALSE)

Filtering the Data Further

We will pivot longer the crash data to reduce the number of variables and increase the number of observations. The data is originally separated column-wise based on the type of injuries however, we aim to create a single column that lists the injuries as observations instead. Also, since each type of injury was listed for every street name, any values equivalent to 0, was dropped.

df1_pivot_bike_crashes = df1_bike_crashes %>%
  pivot_longer(cols=c("majorinjuries_bicyclist","minorinjuries_bicyclist","fatal_bicyclist"),
               names_to = "injuries",
               values_to = "number_of_injuries") %>%
  dplyr::select(-c(bicyclistsimpaired)) %>%
  relocate(c(injuries, number_of_injuries), .after = ward) %>%
  mutate(injuries = recode(injuries, majorinjuries_bicyclist = "major",
  minorinjuries_bicyclist = "minor",
  fatal_bicyclist = "fatal" )) %>%
  filter(number_of_injuries >= "1")

Next, we filter the bike crash data to only show bicycle related crashes in September 2023. We chose this time period specifically because the ridership data is only applicable to this time range as well.

Sept_df1_pivot_bike_crashes = df1_pivot_bike_crashes %>%
 filter(grepl("2023/09", fromdate))

We will move on to filtering the bike lane data. Similarly for the crash data, we will also pivot longer this set to create a single column that lists out the different type of bike lanes into one column. We wanted columns that separated the “adjacent_to_bikelane” and “protection_of_lane” observations thus, we had to pivot the table twice.

We will not look at the contraflow lane as it does not fit into the data we would like to look at, which includes the location of the bike lane adjacent to motorized vehicles lanes and the design of the lane that provide protection. Contraflow only deals with the direction the lane is flowing in relations to traffic. Also, there is significant NA in the observations, which we removed.

df1_pivot_bike_lanes = bikes_lanes %>%
  pivot_longer(col=c("bikelane_parkinglane_adjacent","bikelane_throughlane_adjacent","bikelane_pocketlane_adjacent"),
  names_to = "adjacent_to_bikelane",
  values_to = "direction_of_lane_adjacent") %>%
pivot_longer(col=c("bikelane_conventional","bikelane_dual_buffered","bikelane_protected", "bikelane_buffered"),
  names_to = "protection_of_bikelane",
  values_to = "direction_of_lane_protection") %>%
  dplyr::select(-bikelane_contraflow) %>%
  relocate(c(adjacent_to_bikelane,direction_of_lane_adjacent, protection_of_bikelane, direction_of_lane_protection), .after = streettype) %>%
  mutate(adjacent_to_bikelane = recode(adjacent_to_bikelane, bikelane_throughlane_adjacent = "throughlane", bikelane_pocketlane_adjacent = "pocketlane", bikelane_parkinglane_adjacent = "parkinglane")) %>%
  mutate(protection_of_bikelane = recode(protection_of_bikelane, bikelane_protected = "protected", bikelane_buffered = "buffered", bikelane_conventional = "conventional", bikelane_dual_protected = "dual_protected", bikelane_dual_buffered = "dual_buffered")) %>%
   drop_na(c(direction_of_lane_adjacent, direction_of_lane_protection))

Plotting the Newly Filter Data Sets

Since we filter the data sets as df objects, we have to make sure both sets are set to sf objects before plotting again.

class(df1_pivot_bike_lanes)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"
class(Sept_df1_pivot_bike_crashes)
## [1] "tbl_df"     "tbl"        "data.frame"

The crash data needs to be set to a sf object.

sf_Sept_df1_pivot_bike_crashes <- Sept_df1_pivot_bike_crashes %>%
  st_as_sf(coords = c("longitude","latitude"), crs = 4326) %>%
  st_cast("POINT")

sf_df1_pivot_bike_crashes <- df1_pivot_bike_crashes%>%
   st_as_sf(coords = c("longitude","latitude"), crs = 4326) %>%
  st_cast("POINT")

class(sf_Sept_df1_pivot_bike_crashes)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

Now, that both sets are in the correct classification, we will attempt to plot these sets.

First, we will look at the different classification of the bike lanes, mainly the protection_of_bikelane variable.

Defining a palette for the maps

pal = mapviewPalette("mapviewSpectralColors")
mapview(dc_shape, lwd=2, alpha = 0.5, alpha.regions = 0.2) + 
  mapview(df1_pivot_bike_lanes, zcol = "protection_of_bikelane", burst = TRUE, lwd = 3, alpha = 0.8, layer.name = "bike_lanes", legend = TRUE)
mapview(dc_shape, lwd=2, alpha = 0.5, alpha.regions = 0.2) + 
  mapview(df1_pivot_bike_lanes, lwd = 2, zcol ="protection_of_bikelane",burst= TRUE, alpha = 0.8, layer.name = "bike_lanes", legend = TRUE) +
  mapview(sf_Sept_df1_pivot_bike_crashes, zcol = "injuries", burst= TRUE, col.region = pal, lwd = 2, layer.name = "bike_crashes", legend = TRUE)

Appyling Code Back to a Larger Set of Crash Data

After reducing the crash data, we were able to get our code to work, thus we will plot another map including more observations from the crash data.

mapview(dc_shape, lwd=2, alpha = 0.5, alpha.regions = 0.2) + 
  mapview(df1_pivot_bike_lanes, lwd = 2, zcol ="protection_of_bikelane",burst= TRUE, alpha = 0.8, layer.name = "bike_lanes", legend = TRUE) +
  mapview(sf_df1_pivot_bike_crashes, zcol = "injuries", burst= TRUE, col.region = pal, lwd = 2, layer.name = "bike_crashes", legend = TRUE)

Trying a Different Plotting Method - Starting with a Smaller Data Set

tmap_mode("view")
## tmap mode set to interactive viewing
sf_Sept_df1_pivot_bike_crashes %>%
tm_shape()+
  tm_facets("injuries")+
  tm_dots("injuries", legend.show = FALSE)

Applying the following code to the entire data set: See the division of injuries~

sf_df1_pivot_bike_crashes %>%
tm_shape()+
  tm_facets("injuries")+
  tm_dots("injuries", legend.show = FALSE)

Applying the same code to the bike lane data:

df1_pivot_bike_lanes %>%
tm_shape()+
  tm_facets("protection_of_bikelane")+
  tm_dots("protection_of_bikelane", legend.show = FALSE)

Creating a Plot to Show a Clearer Comparison of Bike Lanes vs. Injury Type

First, we will attempt to join the bike lane and crash data to create some sort scatter or bar graph to numerical represent this relationship.

Starting off with the smaller September bike crash data set and grouping the data based on the street name. Then, we will create a new variable that counts the number of accidents per street.

test_Sept_df1_pivot_bike_crashes = Sept_df1_pivot_bike_crashes %>% 
  count(nearestintstreetname) %>%
  rename(number_of_accidents = n)

Since this code worked for the smaller data set, we will apply it to the larger crash data. Also, NA, unknown values, alleys, and driveways will be removed as we are only concerned with the streets. We are focus primarily on the interaction of bikes on the roads.

accidents_df1_pivot_bike_crashes = df1_bike_crashes %>%
   count(nearestintstreetname) %>%
  rename(number_of_accidents = n) %>%
  rename(routename = nearestintstreetname) %>%
  drop_na() %>%
  filter(!grepl("Not Found|Alley|Driveway",routename))

Moving on to the bike lane data. Grouping the route name together and counting the number of lane protection for each route. Removing conventional bike lanes from the data set, we want to see if there are less accidents if there are protected lanes. Need to make sure that the data set is a data frame and not a sf object.

df2_pivot_bike_lanes = csv_bikes_lanes %>%
  pivot_longer(col=c("bikelane_parkinglane_adjacent","bikelane_throughlane_adjacent","bikelane_pocketlane_adjacent"),
  names_to = "adjacent_to_bikelane",
  values_to = "direction_of_lane_adjacent") %>%
pivot_longer(col=c("bikelane_conventional","bikelane_dual_buffered","bikelane_protected", "bikelane_buffered"),
  names_to = "protection_of_bikelane",
  values_to = "direction_of_lane_protection") %>%
  dplyr::select(-bikelane_contraflow) %>%
  relocate(c(adjacent_to_bikelane,direction_of_lane_adjacent, protection_of_bikelane, direction_of_lane_protection), .after = streettype) %>%
  mutate(adjacent_to_bikelane = recode(adjacent_to_bikelane, bikelane_throughlane_adjacent = "throughlane", bikelane_pocketlane_adjacent = "pocketlane", bikelane_parkinglane_adjacent = "parkinglane")) %>%
  mutate(protection_of_bikelane = recode(protection_of_bikelane, bikelane_protected = "protected", bikelane_buffered = "buffered", bikelane_conventional = "conventional", bikelane_dual_protected = "dual_protected", bikelane_dual_buffered = "dual_buffered")) %>%
   drop_na(c(direction_of_lane_adjacent, direction_of_lane_protection))
test_df2_pivot_bike_lanes = df2_pivot_bike_lanes %>% 
  filter(!grepl("conventional", protection_of_bikelane)) %>%
  count(routename) %>%
  rename(number_of_protected_lanes = n)

Joining the two data frames together

test1_df2_pivot_bike_lanes = test_df2_pivot_bike_lanes %>%
  left_join(accidents_df1_pivot_bike_crashes, by = "routename")
test1_df2_pivot_bike_lanes[is.na(test1_df2_pivot_bike_lanes)] <- 0

Creating a basic plot to see the relationship between number of accidents and protected lanes.

test1_df2_pivot_bike_lanes %>%
ggplot(aes(number_of_protected_lanes, routename)) +
  geom_bar(stat="identity")

Another Test

test_df1_bike_crashes = df1_pivot_bike_crashes %>%
  rename(routename = nearestintstreetname) %>%
  drop_na() %>%
  filter(!grepl("Not Found|Alley|Driveway",routename)) %>%
  left_join(test_df2_pivot_bike_lanes, by = "routename")

test_df1_bike_crashes[is.na(test_df1_bike_crashes)] <- 0
test_df1_bike_crashes$fromdate <- as.Date(test_df1_bike_crashes$fromdate)
test_df1_bike_crashes$fromdate <- as.POSIXct(test_df1_bike_crashes$fromdate, format = "%Y-%m-%d")
test_df1_bike_crashes %>%
  
ggplot(aes(fromdate, injuries, fill=injuries)) +
  geom_quasirandom(bandwidth = .05,alpha=.2)+
   theme(legend.position = "none")+
  scale_x_datetime(date_breaks="1 month", date_labels="%b")
## Orientation inferred to be along y-axis; override with
## `position_quasirandom(orientation = 'x')`

Generally, as the total number of protected lane decreases by route, the number of accidents increases as shown in the graph below.

test1_df2_pivot_bike_lanes %>%
ggplot(aes(number_of_protected_lanes, routename, fill = number_of_accidents))+
  geom_bar(stat="identity")+
  scale_fill_gradient(low = "antiquewhite2", high="coral4")+
  labs(x="Total Number of Protected Lanes", y = "Route Name", 
       title = "Relationship Between the Number of Protected Lanes and Accident Occurence based on the Bike Route", fill = "Number of Accidents")